Efficient and effective ER with progressive blocking

نویسندگان

چکیده

Blocking is a mechanism to improve the efficiency of entity resolution (ER) which aims quickly prune out all non-matching record pairs. However, depending on distributions cluster sizes, existing techniques can be either (a) too aggressive, such that they help scale but adversely affect ER effectiveness, or (b) permissive, potentially harming efficiency. In this paper, we propose new methodology progressive blocking (pBlocking) enable both efficient and effective ER, works seamlessly across different size distributions. pBlocking based insight effectiveness–efficiency trade-off revealed only when output starts available. Hence, leverages partial in feedback loop refine result data-driven fashion. Specifically, bootstrap with traditional methods progressively building scoring blocks until get desired trade-off, leveraging limited amount results as guidance at every round. We formally prove converges efficiently ( $$O(n \log ^2 n)$$ time complexity, where n total number records). Our experiments show incorporating effectiveness by 5 $$\times $$ 60%, respectively, improving overall F-score entire process up 60%.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Reduction Effect of Effective Bandwidth and Blocking Rate with Dispersion

This paper analytically studies performance improvement with traffic dispersion for QoS guaranteed applications. In the analysis, we suppose that connection admission control based on the effective bandwidth is performed in packet network. Numerical results exhibit that traffic dispersion can greatly reduce the total effective bandwidth required by a source and the blocking rate of sources.

متن کامل

An Effective Hybrid Genetic Algorithm for Hybrid Flow Shops with Sequence Dependent Setup Times and Processor Blocking

Hybrid flow-shop or flexible flow shop problems have remained subject of intensive research over several years. Hybrid flow-shop problems overcome one of the limitations of the classical flow-shop model by allowing parallel processors at each stage of task processing. In many papers the assumptions are generally made that there is unlimited storage available between stages and the setup times a...

متن کامل

Efficient Progressive Skyline Computation

In this paper, we focus on the retrieval of a set of interesting answers called the skyline from a database. Given a set of points, the skyline comprises the points that are not dominated by other points. A point dominates another point if it is as good or better in all dimensions and better in at least one dimension. We present two novel algorithms, Bitmap and Index, to compute the skyline of ...

متن کامل

Effective and efficient reuse with software libraries

Research in software engineering has shown that software reuse positively affects the competitiveness of an organization: the productivity of the development team is increased, the time-to-market is reduced, and the overall quality of the resulting software is improved. Today’s code repositories on the Internet provide a large number of reusable software libraries with a variety of functionalit...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: The Vldb Journal

سال: 2021

ISSN: ['0949-877X', '1066-8888']

DOI: https://doi.org/10.1007/s00778-021-00656-7